Authors: Luis Carlos Ospina, Pau Ribera and Nicolás Costa

Introduction

In this analysis, we explore gene expression changes in HIV-infected and non-infected cell lines treated with SAHA, PMA, and DMSO. The aim is to apply dimensionality reduction techniques (PCA and tSNE) to visualize and understand the biological variability and potential batch effects in the dataset. This initial exploratory step will inform downstream differential expression analysis and biological interpretation.

QUESTION 1: Load, adapt the data and create metadata

Reusable functions

Run PCA and create a dataframe

# this reusable function runs PCA and returns a dataframe with metadata
run_pca_with_metadata <- function(expr_matrix, metadata_df,BOOL_scale) {
  pca_result <- prcomp(expr_matrix, center = TRUE, scale. = BOOL_scale)
  pca_df <- as.data.frame(pca_result$x)
  combined_df <- cbind(pca_df, metadata_df[, c("sample", "cell_type", "treatment", "replicate")])
  return(combined_df)
}

Plot a PCA dataframe

# this reusable function plots a PCA dataframe
plot_pca_reduction <- function(df, x_col = "PC1", y_col = "PC2", title = "", shape_by = "replicate", alpha = 0.8) {
  ggplot(df, aes_string(x = x_col, y = y_col, color = "treatment", shape = shape_by)) +
    geom_point(size = 2, alpha = alpha) +
    scale_shape_manual(values = c("rep1" = 3, "rep2" = 16, "rep3" = 12,
                                  "J-LatA2" = 3, "Jurkat" = 16)) +
    theme_minimal() +
    labs(title = title, x = x_col, y = y_col)
}

Run t-SNE with Metadata

# this reusable function runs tSNE and returns a dataframe with metadata
run_tsne_with_metadata <- function(expr_matrix, metadata_df, perplexity = 30, max_iter = 1000, seed = 42) {
  set.seed(seed)
  tsne_result <- Rtsne(as.matrix(expr_matrix), dims = 2, perplexity = perplexity, max_iter = max_iter)
  tsne_df <- data.frame(tsne_result$Y)
  result <- cbind(tsne_df, metadata_df[, c("sample", "cell_type", "treatment", "replicate")])
  return(result)
}

Plot a tSNE

# this reusable function plots a tSNE dataframe
plot_tsne_reduction <- function(df, x_col = "X1", y_col = "X2", title = "", shape_by = "cell_type", alpha = 0.8) {
  ggplot(df, aes_string(x = x_col, y = y_col, color = "treatment", shape = shape_by)) +
    geom_point(size = 2, alpha = alpha) +
    scale_shape_manual(values = c("rep1" = 3, "rep2" = 16, "rep3" = 12,
                                  "J-LatA2" = 3, "Jurkat" = 16)) +
    theme_minimal() +
    labs(title = title, x = x_col, y = y_col)
}

Run and plot t-SNE

# this is a function to run and plot tSNE with a given parameter
run_and_plot_tsne <- function(expr, metadata, perplexity = 30, max_iter = 1000, seed = 42, title_suffix = "") {
  df <- run_tsne_with_metadata(expr, metadata, perplexity = perplexity, max_iter = max_iter, seed = seed)
  plot_tsne_reduction(df, title = title_suffix, shape_by = "replicate")
}

Run and plot a tSNE

# this is a function to run and plot tSNE with a given parameter
run_and_plot_tsne <- function(expr, metadata, perplexity = 30, max_iter = 1000, seed = 42, title_suffix = "") {
  df <- run_tsne_with_metadata(expr, metadata, perplexity = perplexity, max_iter = max_iter, seed = seed)
  plot_tsne_reduction(df, title = title_suffix)
}

Project new data onto a pre-trained UMAP model

# this reusable function projects new data onto a pre-trained UMAP model
project_umap <- function(train_expr, test_expr, metadata_train, metadata_test,
                         n_neighbors = 15, min_dist = 0.1, metric = "euclidean", seed = 42) {
  set.seed(seed)
  
  # fit UMAP on training data
  umap_model <- umap(as.matrix(train_expr),
                     n_neighbors = n_neighbors,
                     min_dist = min_dist,
                     metric = metric,
                     n_components = 2,
                     ret_model = TRUE)
  
  # embed the training data
  umap_train <- data.frame(umap_model$embedding)
  colnames(umap_train) <- c("UMAP1", "UMAP2")
  umap_train$set <- "train"
  train_combined <- cbind(umap_train, metadata_train)
  
  # project the test data
  umap_test <- data.frame(umap_transform(as.matrix(test_expr), umap_model))
  colnames(umap_test) <- c("UMAP1", "UMAP2")
  umap_test$set <- "test"
  test_combined <- cbind(umap_test, metadata_test)
  
  # merge
  final <- rbind(train_combined, test_combined)
  return(final)
}

Run UMAP with Metadata

# this reusable function runs UMAP and returns a dataframe with metadata
run_umap_with_metadata <- function(expr_matrix, metadata_df, n_neighbors = 15, min_dist = 0.1, metric = "euclidean", seed = 42) {
  set.seed(seed)
  
  umap_result <- umap(as.matrix(expr_matrix), n_neighbors = n_neighbors, min_dist = min_dist, metric = metric, n_components = 2)
  
  umap_df <- data.frame(UMAP1 = umap_result[,1], UMAP2 = umap_result[,2])
  combined_df <- cbind(umap_df, metadata_df[, c("sample", "cell_type", "treatment", "replicate")])
  return(combined_df)
}

Plot a UMAP dataframe

# this reusable function plots a UMAP dataframe
plot_umap_reduction <- function(df, x_col = "UMAP1", y_col = "UMAP2", title = "", shape_by = "cell_type", alpha = 0.8) {
  ggplot(df, aes_string(x = x_col, y = y_col, color = "treatment", shape = shape_by)) +
    geom_point(size = 2, alpha = alpha) +
    scale_shape_manual(values = c("rep1" = 3, "rep2" = 16, "rep3" = 12,
                                  "J-LatA2" = 3, "Jurkat" = 16,
                                  "train" = 16, "test" = 3)) +
    theme_minimal() +
    labs(title = title, x = x_col, y = y_col)
}

Run and plot a UMAP

# this function runs UMAP and returns the plot
run_and_plot_umap <- function(expr, metadata, n_neighbors = 15, min_dist = 0.1, metric = "euclidean", title_suffix = "", shape_by = "replicate") {
  df <- run_umap_with_metadata(expr, metadata, n_neighbors = n_neighbors, min_dist = min_dist, metric = metric)
  plot_umap_reduction(df, title = paste("UMAP -", title_suffix), shape_by = shape_by)
}

Generic plot for dimensionality reduction

# this reusable function plots any dimensionality reduction result (PCA, tSNE, UMAP) shaping by cell type or replicate
plot_dimred <- function(df, x = "PC1", y = "PC2", title = "", shape_by = "cell_type") {
  shape_scale <- switch(shape_by,
    "cell_type" = scale_shape_manual(values = c("J-LatA2" = 3, "Jurkat" = 16)),
    "replicate" = scale_shape_manual(values = c("rep1" = 3, "rep2" = 16, "rep3" = 12)),
    scale_shape_manual(values = NULL)  # fallback
  )

  ggplot(df, aes_string(x = x, y = y, color = "treatment", shape = shape_by)) +
    geom_point(size = 2, alpha = 0.8) +
    shape_scale +
    theme_minimal() +
    labs(title = title, x = x, y = y)
}

Data Loading and Preprocessing

First, we load the raw counts matrix, where genes are rows and samples are columns.
We also load the metadata file, which contains information about each sample encoded in a single string.

Next, we split the metadata string into separate columns: sample name, cell type, treatment group, and replicate number.

We filter the metadata to keep only the samples present in the counts matrix, ensuring consistency.

Then, we reorder the metadata to match the order of samples in the counts matrix.
This step is crucial for correct merging and analysis.

Since the counts matrix has genes as rows and samples as columns, we transpose it to have samples as rows and genes as columns, which is the preferred format for most analyses.

Finally, we merge the metadata with the transposed counts matrix by sample name, creating a combined dataset for downstream analysis.

To check data balance, we plot the distribution of samples by treatment and cell type interactively.

counts <- read.csv("counts.csv", row.names = 1)

metadata_raw <- read.csv("sample_sheet.csv", header = FALSE)

colnames(metadata_raw) <- "info"

metadata <- metadata_raw %>%
  separate(info, into = c("sample", "cell_type", "treatment", "replicate"), sep = ";")

metadata <- metadata %>%
  filter(sample %in% colnames(counts))

metadata <- metadata %>%
  arrange(factor(sample, levels = colnames(counts)))

counts_t <- as.data.frame(t(counts))

counts_t$sample <- rownames(counts_t)

final_data <- inner_join(metadata, counts_t, by = "sample") 

g1 <- ggplot(metadata, aes(x = treatment, fill = cell_type)) +
  geom_bar(position = "dodge") +
  theme_minimal() +
  labs(title = "Sample distribution by treatment and cell type",
       x = "Treatment", y = "Number of samples")

ggplotly(g1)

Interpretation:

The barplot shows the number of samples per treatment and cell type. We observe that J-LatA2 cells (HIV-infected) are more represented across all treatments, especially for PMA and SAHA. Jurkat cells (non-infected) are fewer but evenly distributed across treatments. This sample imbalance should be kept in mind for further analysis as it could introduce bias or batch effects in plots like PCA or tSNE.

QUESTION 2: PCA representation

PCA on raw counts

According to cell type
expr_raw <- final_data %>%
  select(-sample, -cell_type, -treatment, -replicate)


# ALTERNATIVE:

#expr_raw <- final_data %>%
#  select(where(is.numeric))  # this automatically selects all gene columns

# to run and plot PCA
pca_df_raw <- run_pca_with_metadata(expr_raw, final_data, FALSE)
plot_pca_reduction(pca_df_raw, title = "PCA on raw counts", shape_by = "cell_type")
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

According to replicate
plot_pca_reduction(pca_df_raw, title = "PCA on raw counts")

PCA on normalized data

According to cell type
# PCA on scaled and normalized data:
pca_df_scaled <- run_pca_with_metadata(expr_raw, final_data, TRUE)
plot_pca_reduction(pca_df_scaled, title = "PCA on scaled data", shape_by = "cell_type")

According to replicate
# PCA on scaled data:
plot_pca_reduction(pca_df_scaled, title = "PCA on scaled data")

PCA on scaled normalized data

According to replicate
# to normalize each sample by its total expression (relative expression)
norm_expr <- expr_raw/rowSums(expr_raw)


pca_df_norm <- run_pca_with_metadata(norm_expr, final_data, TRUE)

# PCA scaled and normalized
plot_pca_reduction(pca_df_norm, "PC1", "PC2", title = "Scaled normalized data")

According to cell type
plot_pca_reduction(pca_df_norm, "PC1", "PC2", title = "Raw data", shape_by = "cell_type")

Correlation computation

According to treatment
According to normalized data
# Correlation computation:

summary(lm(PC1 ~ treatment, data = pca_df_norm))
## 
## Call:
## lm(formula = PC1 ~ treatment, data = pca_df_norm)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.8340  -3.7037   0.3409   3.7078  18.2998 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -4.3769     0.8142  -5.375 1.66e-07 ***
## treatmentPMA  -14.5575     0.9828 -14.812  < 2e-16 ***
## treatmentSAHA  28.1843     1.0037  28.079  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.928 on 268 degrees of freedom
## Multiple R-squared:  0.9143, Adjusted R-squared:  0.9136 
## F-statistic:  1429 on 2 and 268 DF,  p-value: < 2.2e-16
summary(lm(PC2 ~ treatment, data = pca_df_norm))
## 
## Call:
## lm(formula = PC2 ~ treatment, data = pca_df_norm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -63.796 -11.412   4.384  11.041  61.767 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -9.890      2.234  -4.428 1.39e-05 ***
## treatmentPMA    11.335      2.696   4.204 3.57e-05 ***
## treatmentSAHA   13.386      2.753   4.862 1.98e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.26 on 268 degrees of freedom
## Multiple R-squared:  0.08606,    Adjusted R-squared:  0.07924 
## F-statistic: 12.62 on 2 and 268 DF,  p-value: 5.79e-06
According to raw data
summary(lm(PC1 ~ treatment, data = pca_df_raw))
## 
## Call:
## lm(formula = PC1 ~ treatment, data = pca_df_raw)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -16259  -8156   -383   5017  43564 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -2503.7     1388.3  -1.803 0.072453 .  
## treatmentPMA     592.2     1675.7   0.353 0.724079    
## treatmentSAHA   5978.5     1711.4   3.493 0.000558 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10110 on 268 degrees of freedom
## Multiple R-squared:  0.06769,    Adjusted R-squared:  0.06073 
## F-statistic: 9.728 on 2 and 268 DF,  p-value: 8.343e-05
summary(lm(PC2 ~ treatment, data = pca_df_raw))
## 
## Call:
## lm(formula = PC2 ~ treatment, data = pca_df_raw)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16271.7  -3295.9    557.2   2431.7  17296.8 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1468.6      618.2   2.376   0.0182 *  
## treatmentPMA    3142.4      746.2   4.211 3.47e-05 ***
## treatmentSAHA  -7475.7      762.1  -9.810  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4501 on 268 degrees of freedom
## Multiple R-squared:  0.5356, Adjusted R-squared:  0.5322 
## F-statistic: 154.6 on 2 and 268 DF,  p-value: < 2.2e-16
According to Cell type
According to normalized data
summary(lm(PC1 ~ cell_type, data = pca_df_norm))
## 
## Call:
## lm(formula = PC1 ~ cell_type, data = pca_df_norm)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -39.17 -16.00  -5.58  22.52  40.91 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)
## (Intercept)      -0.5958     1.5006  -0.397    0.692
## cell_typeJurkat   1.7940     2.6039   0.689    0.491
## 
## Residual standard error: 20.19 on 269 degrees of freedom
## Multiple R-squared:  0.001761,   Adjusted R-squared:  -0.00195 
## F-statistic: 0.4747 on 1 and 269 DF,  p-value: 0.4914
summary(lm(PC2 ~ cell_type, data = pca_df_norm))
## 
## Call:
## lm(formula = PC2 ~ cell_type, data = pca_df_norm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -57.696 -12.499   3.546  12.794  60.027 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -2.604      1.231  -2.115 0.035390 *  
## cell_typeJurkat    7.841      2.137   3.669 0.000293 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.57 on 269 degrees of freedom
## Multiple R-squared:  0.04766,    Adjusted R-squared:  0.04412 
## F-statistic: 13.46 on 1 and 269 DF,  p-value: 0.0002932
According to raw data
summary(lm(PC1 ~ cell_type, data = pca_df_raw))
## 
## Call:
## lm(formula = PC1 ~ cell_type, data = pca_df_raw)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -13165  -9021  -1632   4921  46658 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)
## (Intercept)       -189.5      776.3  -0.244    0.807
## cell_typeJurkat    570.5     1347.2   0.424    0.672
## 
## Residual standard error: 10440 on 269 degrees of freedom
## Multiple R-squared:  0.0006663,  Adjusted R-squared:  -0.003049 
## F-statistic: 0.1794 on 1 and 269 DF,  p-value: 0.6723
summary(lm(PC2 ~ cell_type, data = pca_df_raw))
## 
## Call:
## lm(formula = PC2 ~ cell_type, data = pca_df_raw)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -22450  -4120  -1331   4414  22253 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)
## (Intercept)        171.6      489.7   0.350    0.726
## cell_typeJurkat   -516.6      849.7  -0.608    0.544
## 
## Residual standard error: 6588 on 269 degrees of freedom
## Multiple R-squared:  0.001372,   Adjusted R-squared:  -0.00234 
## F-statistic: 0.3696 on 1 and 269 DF,  p-value: 0.5437
According to Replicate
According to normalized data
summary(lm(PC1 ~ replicate, data = pca_df_norm))
## 
## Call:
## lm(formula = PC1 ~ replicate, data = pca_df_norm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.854 -16.489  -5.463  22.180  42.704 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)      3.570      2.115   1.688   0.0926 .
## replicaterep2   -6.484      2.974  -2.180   0.0301 *
## replicaterep3   -4.166      2.999  -1.389   0.1659  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.06 on 268 degrees of freedom
## Multiple R-squared:  0.01784,    Adjusted R-squared:  0.01051 
## F-statistic: 2.435 on 2 and 268 DF,  p-value: 0.08957
summary(lm(PC2 ~ replicate, data = pca_df_norm))
## 
## Call:
## lm(formula = PC2 ~ replicate, data = pca_df_norm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -69.483  -5.926   0.425   7.216  56.331 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -18.221      1.166  -15.63   <2e-16 ***
## replicaterep2   27.403      1.640   16.71   <2e-16 ***
## replicaterep3   27.153      1.654   16.42   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.06 on 268 degrees of freedom
## Multiple R-squared:  0.577,  Adjusted R-squared:  0.5739 
## F-statistic: 182.8 on 2 and 268 DF,  p-value: < 2.2e-16
According to raw data
summary(lm(PC1 ~ replicate, data = pca_df_raw))
## 
## Call:
## lm(formula = PC1 ~ replicate, data = pca_df_raw)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -22143  -2389   -220   1911  37096 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     9943.0      703.3   14.14   <2e-16 ***
## replicaterep2 -10197.7      989.1  -10.31   <2e-16 ***
## replicaterep3 -19734.5      997.3  -19.79   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6672 on 268 degrees of freedom
## Multiple R-squared:  0.5938, Adjusted R-squared:  0.5907 
## F-statistic: 195.9 on 2 and 268 DF,  p-value: < 2.2e-16
summary(lm(PC2 ~ replicate, data = pca_df_raw))
## 
## Call:
## lm(formula = PC2 ~ replicate, data = pca_df_raw)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23903.8  -3405.8    546.4   3741.0  20282.8 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1625.0      672.4   2.417   0.0163 *  
## replicaterep2   -918.5      945.7  -0.971   0.3323    
## replicaterep3  -3998.6      953.6  -4.193 3.74e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6379 on 268 degrees of freedom
## Multiple R-squared:  0.06715,    Adjusted R-squared:  0.06019 
## F-statistic: 9.646 on 2 and 268 DF,  p-value: 9.008e-05
According to Total Expression
According to normalized data
pca_df_norm$total_expression <- rowSums(expr_raw)
summary(lm(PC1 ~ total_expression, data = pca_df_norm))
## 
## Call:
## lm(formula = PC1 ~ total_expression, data = pca_df_norm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.313 -15.687  -6.279  21.494  45.374 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)  
## (Intercept)      -4.877e+00  2.333e+00  -2.090   0.0375 *
## total_expression  1.414e-05  5.778e-06   2.448   0.0150 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.99 on 269 degrees of freedom
## Multiple R-squared:  0.02179,    Adjusted R-squared:  0.01815 
## F-statistic: 5.992 on 1 and 269 DF,  p-value: 0.01501
summary(lm(PC2 ~ total_expression, data = pca_df_norm))
## 
## Call:
## lm(formula = PC2 ~ total_expression, data = pca_df_norm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -67.043  -8.068   0.891   9.076  57.273 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.193e+01  1.790e+00   6.666 1.48e-10 ***
## total_expression -3.460e-05  4.432e-06  -7.806 1.32e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.33 on 269 degrees of freedom
## Multiple R-squared:  0.1847, Adjusted R-squared:  0.1816 
## F-statistic: 60.93 on 1 and 269 DF,  p-value: 1.319e-13
According to raw data
pca_df_raw$total_expression <- rowSums(expr_raw)
summary(lm(PC1 ~ total_expression, data = pca_df_raw))
## 
## Call:
## lm(formula = PC1 ~ total_expression, data = pca_df_raw)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -9788  -2354    116   1477  15789 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1.590e+04  4.457e+02  -35.67   <2e-16 ***
## total_expression  4.612e-02  1.104e-03   41.78   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3818 on 269 degrees of freedom
## Multiple R-squared:  0.8665, Adjusted R-squared:  0.866 
## F-statistic:  1745 on 1 and 269 DF,  p-value: < 2.2e-16
summary(lm(PC2 ~ total_expression, data = pca_df_raw))
## 
## Call:
## lm(formula = PC2 ~ total_expression, data = pca_df_raw)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23062.8  -2974.0    275.6   4041.1  18987.7 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -2.543e+03  7.479e+02  -3.400 0.000776 ***
## total_expression  7.374e-03  1.852e-03   3.982 8.81e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6406 on 269 degrees of freedom
## Multiple R-squared:  0.05566,    Adjusted R-squared:  0.05214 
## F-statistic: 15.85 on 1 and 269 DF,  p-value: 8.812e-05

Batch effect

total_expression <- rowSums(expr_raw)

# Define color palette for replicates (experiments)
color_palette <- c("rep1" = "black", "rep2" = "red", "rep3" = "blue")  # customize as needed

# Assign colors by replicate
colors <- color_palette[metadata$replicate]

# Plot
plot(total_expression, 
     type = "h", 
     col = colors,
     xlab = "Samples",
     ylab = "Total Gene Expression",
     main = "Total Gene Expression by Replicate")

# Legend
legend("topright", legend = names(color_palette), col = color_palette, pch = 15)

PCA interpretation:

In the initial principal component analysis (PCA) of the raw data, a pronounced batch effect was observed. Samples clustered primarily according to the replicate variable, indicating that technical variation was a dominant source of variation. This was visually evident in the PCA plots and further confirmed by the “Batch Effect” plot, which showed total gene expression levels varying systematically across replicates.

Correlation analysis reinforced this observation. In the replicate-specific tab, the raw data showed a statistically significant relationship between replicate and both PCA axes. In particular, PC1 displayed extremely large coefficient values (e.g., -19,734.5 and -10,197.7), highlighting the strong influence of technical replicate effects.

After normalization, the batch effect was substantially reduced. In the updated linear models, p-values became either non-significant or, when still significant, were associated with much smaller coefficients (e.g., 27.4 and 27.1). This confirmed that replicate-driven variability was greatly diminished, which is also reflected in the normalized PCA plots.

Importantly, PC2 showed a very strong association with treatment, confirming that this component captures the main treatment-driven variability in gene expression. PC1 was also significantly associated with treatment, although to a lesser extent. Furthermore, PC2 reflects differences between Jurkat and J-LatA2 cells, suggesting it captures meaningful biological variation across cell types.

In summary, PCA effectively captured both technical and biological sources of variability. PC1 primarily reflects treatment effects and a residual replicate influence, while PC2 clearly separates the data by both treatment and cell type. This supports the use of PCA as a meaningful tool for biological interpretation and confirms that normalization greatly improved the interpretability of the data by reducing batch effects and enhancing biological signal.

Additional PCA visualizations

Scree plot using factoextra

# ADDITIONAL PLOT: A scree plot using factoextra

# this visualizes the variance explained by each component using factoextra
fviz_eig(prcomp(norm_expr, center = TRUE, scale. = FALSE), addlabels = TRUE, barfill = "steelblue", barcolor = "black") +
  theme_minimal() +
  labs(title = "Scree plot using factoextra")

Scree plot interpretation

The scree plot displays the percentage of variance explained by each principal component (PC). We observe that:

PC1 explains 33.8% of the total variance,

PC2 explains 14.9%, and

PC3 explains 10.2%.

Together, the first three components capture nearly 59% of the total variance, indicating that a substantial amount of the structured variability in the dataset is concentrated in these first few components.

This steep drop in explained variance between the first and subsequent components suggests a clear elbow point after PC3, meaning that additional components contribute progressively less meaningful information and are more likely to capture noise or minor variation.

Despite the diminishing variance explained beyond the third component, the first two to three PCs are sufficient for capturing key patterns in the data, such as differences in treatment and cell type, as seen in corresponding PCA plots. This supports their use in data visualization and initial exploratory analysis.

Overall, the scree plot confirms that most of the biologically relevant variation is concentrated in the first few principal components, justifying the focus on them for downstream interpretation.

PCA Biplot

# ADDITIONAL PLOTS: Biplots using factoextra

# the biplot coloring by treatment
fviz_pca_biplot(prcomp(norm_expr, center = TRUE, scale. = FALSE), 
                geom.ind = "point", col.ind = final_data$treatment, 
                palette = "jco", addEllipses = TRUE, 
                label = "none") +
  theme_minimal() +
  labs(title = "PCA Biplot with Treatment Coloring")

# the biplot coloring by cell type
fviz_pca_biplot(prcomp(norm_expr, center = TRUE, scale. = FALSE),  
                geom.ind = "point", col.ind = final_data$cell_type,  
                palette = "jco", addEllipses = TRUE,  
                label = "none") +
  labs(title = "PCA Biplot with Cell Type Coloring")

# and the biplot coloring by replicate
fviz_pca_biplot(prcomp(norm_expr, center = TRUE, scale. = FALSE),  
                geom.ind = "point", col.ind = final_data$replicate,  
                palette = "jco", addEllipses = TRUE,  
                label = "none") +
  labs(title = "PCA Biplot with Replicate Coloring")

PCA Biplot interpretation

To better understand the contribution of genes and the relationships between samples, we generated PCA biplots using scaled expression data. We colored the samples by treatment, cell type, and replicate to explore different sources of variability.

We observed that treatment is a major driver of separation along PC2, as SAHA, PMA, and DMSO form well-defined clusters. This aligns with the linear model results showing a strong treatment effect on PC2. Cell type (J-LatA2 vs. Jurkat) shows moderate separation, primarily along PC1. This suggests that PC1 captures the variability between infected and non-infected cells. Replicates are not completely overlapping, indicating the presence of some batch effects, especially in PC1. However, the clusters remain tight within treatments, showing that the biological signal dominates.

The arrows in the biplot represent genes that contribute most to PC1 and PC2. Their direction indicates which genes are more expressed in which sample groups. For example, arrows pointing towards SAHA suggest genes that are upregulated under that treatment.

Interactive expression patterns of top 5 variable genes

# this calculates the variance of each gene
gene_variances <- apply(norm_expr, 2, var) # the "2" here is to apply the "var" function column by column

# this selects the top 5 most variable genes
top_genes <- names(sort(gene_variances, decreasing = TRUE)[1:5])

expr_long <- melt(cbind(final_data[, c("treatment", "cell_type")], norm_expr[, top_genes]),
                  id.vars = c("treatment", "cell_type")) # this change in the shape of the data is very useful here, because visualization functions like ggplot2 work much better when the data is in long format


ggplotly(
  ggplot(expr_long, aes(x = treatment, y = value, fill = treatment,
                        text = paste("Treatment:", treatment,
                                     "<br>Cell type:", cell_type,
                                     "<br>Gene:", variable,
                                     "<br>Expression:", round(value, 2)))) +
    geom_boxplot(alpha = 0.7) +
    facet_wrap(~ variable, scales = "free_y") +
    theme_minimal() +
    labs(title = "Top 5 most variable genes (interactive)", y = "log2 expression", x = "Treatment"),
  tooltip = "text"
)

Expression of top variable genes interpretation

To better understand which genes drive the differences observed in PCA, we plotted the expression of the top 5 most variable genes across treatments. We observe that some genes (e.g., ENSG00000274752.1 or ENSG00000110848.8) show clear differences between treatment groups, and this suggests treatment-specific regulation. This means that if a gene has a high expression with some treatment and low expression with some other treatment, the treatment in which it has a high expression activates that gene, and the other treatments do not.

This supports the PCA results, where treatment explained a large portion of the variability. We think that is important to include this type of gene-level expression plots, since it helps to connect the global patterns seen in PCA with specific biological signals.

QUESTION 3: tSNE representation

tSNE on raw data

# tSNE on raw data:
tsne_df_raw <- run_tsne_with_metadata(expr_raw, final_data)
plot_tsne_reduction(tsne_df_raw, title = "tSNE on raw data")

plot_tsne_reduction(tsne_df_raw, title = "tSNE on raw data", shape_by = "replicate")

tSNE on normalized data

# tSNE on normalized data:
tsne_df_norm <- run_tsne_with_metadata(norm_expr, final_data)
plot_tsne_reduction(tsne_df_norm, title = "tSNE on normalized data")

plot_tsne_reduction(tsne_df_norm, title = "tSNE on normalized data", shape_by = "replicate")

Interactive tSNE representation (normalized)

# interactive tSNE plot (scaled data)
tsne_plot <- ggplot(tsne_df_norm, aes(x = X1, y = X2, color = treatment, shape = replicate,
                                        text = paste("Sample:", sample,
                                                     "<br>Treatment:", treatment,
                                                     "<br>Cell type:", cell_type))) +
  geom_point(size = 2, alpha = 0.8) +
  scale_shape_manual(values = c("rep1" = 3, "rep2" = 16, "rep3" = 12)) +
  theme_minimal() +
  labs(title = "Interactive tSNE (scaled data)", x = "tSNE-1", y = "tSNE-2")

ggplotly(tsne_plot, tooltip = "text")

tSNE interpretation:

We applied tSNE to both raw and normalized gene expression data in order to visualize the structure of the dataset and assess whether normalization improves biological signal detection.

First, we ran tSNE on the raw count matrix. Although some separation between treatment groups was visible, the overall structure was noisy and less defined. Replicates appeared more dispersed within their respective groups, especially for PMA and DMSO, suggesting that technical variability or batch effects might be present in the unnormalized data.

Next, we performed normalization and scaling of the expression matrix before rerunning tSNE. This preprocessing improved the separation between treatment groups and enhanced the internal consistency of replicates. In the normalized tSNE plot, replicates clustered tightly within each treatment and cell type, indicating that normalization successfully reduced batch effects and highlighted biologically meaningful patterns.

We did not observe strong outliers in the dataset, and all samples contributed to well-formed clusters.

To explore possible associations between the tSNE dimensions and metadata, we evaluated the visual alignment of clusters with treatment and cell type. The tSNE axes reflect these metadata variables, as samples grouped according to treatment (DMSO, PMA, SAHA) and cell type (Jurkat, J-LatA2). However, unlike PCA, tSNE does not produce axes with interpretable linear relationships to the original data. For this reason, we did not perform any linear model on the tSNE coordinates. Instead, we did a visual inspection of clustering patterns, which clearly indicated that tSNE captured the main sources of biological variability in the dataset.

In conclusion, the comparison between raw and normalized data confirms that preprocessing was essential for reducing technical variation and improving the interpretability of the dataset in the tSNE projection.

QUESTION 4: tSNE parameters

# defining the values to test
perplexities <- c(5, 30, 50)
iterations <- c(200, 500, 1000)
seeds <- c(21, 42, 99)

# Varying Perplexity (5, 30, 50):

# this tests different values of perplexity to compare local vs global structure

perplexity_plots <- lapply(perplexities, function(p) {
  run_and_plot_tsne(norm_expr, final_data, perplexity = p, title_suffix = paste("Perp =", p))
})

# Varying Iterations (200, 500, 1000):

# this tests how many iterations are needed for good convergence

iteration_plots <- lapply(iterations, function(i) {
  run_and_plot_tsne(norm_expr, final_data, max_iter = i, title_suffix = paste("Iter =", i))
})

# Varying Seed (21, 42, 99):

# this tests whether changing the seed affects the biological interpretation

seed_plots <- lapply(seeds, function(s) {
  run_and_plot_tsne(norm_expr, final_data, seed = s, title_suffix = paste("Seed =", s))
})

Perplexity plots

ggarrange(plotlist = perplexity_plots, ncol = 3, common.legend = TRUE, legend = "right")

Iteration plots

ggarrange(plotlist = iteration_plots, ncol = 3, common.legend = TRUE, legend = "right")

Seed plots

ggarrange(plotlist = seed_plots, ncol = 3, common.legend = TRUE, legend = "right")

tSNE parameters interpretation:

When we used a low perplexity like 5, the groups were very clearly separated, but the overall shape of the plot looked a bit distorted. With perplexity 30, the structure was much clearer and the groups were still well defined, so this value gave us the best result. Perplexity 50 still showed clear separation between groups, but the clusters were more spread out internally, making the plot slightly harder to interpret. So, higher perplexity didn’t mix the groups, but it reduced the sharpness of the clusters.

The number of iterations had a big impact on the result. With only 200 iterations, the algorithm didn’t have enough time to organize the data properly, so everything looked collapsed in one spot. At 500 iterations the result improved, but wasn’t perfect yet. When we used 1000 iterations, the groups were much clearer and well separated, showing that more iterations help tSNE reach a stable and meaningful structure.

Changing the random seed didn’t affect how the samples were grouped, but it did change the overall layout of the plot. In all three cases, the treatment and cell type clusters were consistent, which means the results are stable. Using a fixed seed is useful to make sure the results can be repeated and compared easily, even if the plot looks slightly different.

QUESTION 5: Final interpretation

pca_final_df <- run_pca_with_metadata(norm_expr, final_data, TRUE)

tsne_final_df <- run_tsne_with_metadata(norm_expr, final_data, perplexity = 30, max_iter = 1000, seed = 42)

# and this is to plot both with shared legend
p_final_pca <- plot_pca_reduction(pca_final_df, title = "Final PCA on scaled data", shape_by = "replicate")
p_final_tsne <- plot_tsne_reduction(tsne_final_df, title = "Final tSNE on scaled data", shape_by = "replicate")

ggarrange(p_final_pca, p_final_tsne, ncol = 2, common.legend = TRUE, legend = "right")

PCA and tSNE final interpretation:

To finish the analysis, we compared the results of PCA and tSNE using the final dataset with relative normalization, log-transformation, and scaling. Both plots showed a very consistent structure, with clear separation between treatment groups (DMSO, PMA, SAHA) and tight clustering of replicates. This suggests that our normalization worked well and removed potential batch effects.

In the PCA plot, most of the variation is explained by PC1, which clearly separates the three treatments. PC2 adds some extra structure that seems related to cell type. Replicates (represented by different shapes) are close to each other within each treatment, confirming technical consistency.

The tSNE plot shows a very similar distribution, with the same separation between treatments and strong clustering by replicate and cell type. Since tSNE is non-linear, the layout is different, but the biological interpretation remains the same.

Additionally, linear models confirmed that treatment has a highly significant effect on both PC2 (p-value < 1.39e-05) and especially PC1(p-value < 2e-16), while replicates and cell type also contribute but to a lesser extent. This validates our choice to color plots by treatment, as it is the most explanatory variable in the dataset.

We didn’t detect any strong outliers in the dataset, and all samples formed well-defined clusters.

In conclusion, PCA and tSNE both captured the main biological structure of the dataset. They agree in separating treatments and grouping replicates, which gives us confidence in the quality of the data and the analysis.

QUESTION 6: UMAP interpretation

Raw counts

By cell type

umap_df_raw <- run_umap_with_metadata(expr_raw, final_data)
plot_umap_reduction(umap_df_raw, title = "UMAP on raw counts (shape: cell type)", shape_by = "cell_type")
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

By replicate

plot_umap_reduction(umap_df_raw, title = "UMAP (shape: replicate)", shape_by = "replicate")

Normalized expression

By cell type

umap_df_scaled <- run_umap_with_metadata(norm_expr, final_data)
plot_umap_reduction(umap_df_scaled, title = "UMAP on normalized expression (shape: cell type)", shape_by = "cell_type")
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

By replicate

plot_umap_reduction(umap_df_scaled, title = "UMAP on normalized data (shape: replicate)", shape_by = "replicate")

Linear model summary

summary(lm(UMAP1~replicate, umap_df_scaled))
## 
## Call:
## lm(formula = UMAP1 ~ replicate, data = umap_df_scaled)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.637  -9.916   5.436   6.300   7.964 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)    -0.5711     0.8503  -0.672    0.502
## replicaterep2   0.7651     1.1959   0.640    0.523
## replicaterep3   0.9480     1.2059   0.786    0.432
## 
## Residual standard error: 8.067 on 268 degrees of freedom
## Multiple R-squared:  0.0026, Adjusted R-squared:  -0.004843 
## F-statistic: 0.3493 on 2 and 268 DF,  p-value: 0.7055

UMAP interpretation

We applied UMAP to both the raw count data and the normalized expression data to evaluate how preprocessing affects dimensionality reduction outcomes.

UMAP on raw counts showed a strong and well-defined separation between treatment groups (DMSO, PMA, SAHA). Clusters appeared tight, but some internal variability was noticeable, especially within the same treatment group. This suggests that while the treatment effect is visible, technical artifacts or differences in sequencing depth might still be influencing the layout, as raw counts often contain large-scale variance due to technical factors.

When using cell type as shape, UMAP clearly separated Jurkat and J-LatA2 samples within each treatment. However, the separation was not as compact, and some internal variability was noticeable within treatments, particularly for PMA and DMSO, where replicates from the same treatment show slight shifts. This could reflect subtle biological differences between replicates or technical noise in raw counts.

After normalization, the biological structure became clearer. The treatment groups were sharply separated, and clusters were more symmetric and biologically interpretable. Using cell type as shape, J-LatA2 and Jurkat remained well distinguished, and the layout showed reduced noise, especially within groups.

When switching to replicate as shape, the clusters remained intact and replicates were consistently grouped within their corresponding treatment clusters. The layout was clean and interpretable, suggesting that technical batch effects were effectively mitigated by the normalization.

To support this, we fit a linear model testing whether the UMAP1 coordinate could be explained by the replicate variable (lm(UMAP1 ~ replicate)). The result showed no significant effect (p = 0.5469), confirming that replicate does not introduce systematic variation in the UMAP space after normalization.

No strong outliers were detected in either representation. All points clustered within the expected biological conditions.

QUESTION 7: Predict the data

# this sets a seed to make the sampling reproducible
set.seed(42)

# this randomly selects 80% of the rows for training
sample_indices <- sample(1:nrow(norm_expr), size = 0.8 * nrow(norm_expr))

# this creates the training expression matrix (80%)
expr_train <- norm_expr[sample_indices, ]

# this creates the test expression matrix (20%)
expr_test <- norm_expr[-sample_indices, ]

# this splits the metadata to match the training samples
metadata_train <- final_data[sample_indices, ]

# this splits the metadata to match the test samples
metadata_test <- final_data[-sample_indices, ]

# this runs UMAP on the training data and projects the test samples into the learned space
umap_prediction_df <- project_umap(expr_train, expr_test, metadata_train, metadata_test)

# this is the plot that shows both training and test samples
# training samples are circles, test samples are crosses
plot_umap_reduction(umap_prediction_df, title = "UMAP projection: train vs. predicted test samples", shape_by = "set")

### UMAP projection interpretation:

We trained a UMAP model on 80% of the dataset and used it to project the remaining 20%. In the resulting plot, training samples are shown as circles and projected test samples as crosses. The predicted test points closely followed the structure learned from the training set, clearly preserving the treatment-based clusters (DMSO, PMA, and SAHA).

This tight alignment between test and train embeddings indicates that the UMAP model generalized well to unseen samples. No test samples appeared as outliers or were projected into unexpected regions, further confirming the robustness of the learned manifold.

The preservation of biological grouping — especially the distinct separation between SAHA, PMA, and DMSO — suggests that the dimensionality reduction learned meaningful structure from the gene expression data. This validates the stability of the UMAP embedding and supports its use for generalization and downstream interpretation.

QUESTION #8: UMAP Exploration parameters

Number of Neighbors

neighbor_plots[[1]]  # neighbors = 5

neighbor_plots[[2]]  # neighbors = 15

neighbor_plots[[3]]  # neighbors = 50

Minimum Distance

mindist_plots[[1]]  # min_dist = 0.001

mindist_plots[[2]]  # min_dist = 0.1

mindist_plots[[3]]  # min_dist = 0.5

Distance Metrics

metric_plots[[1]]  # metric = euclidean

metric_plots[[2]]  # metric = manhattan

metric_plots[[3]]  # metric = cosine

UMAP Parameter Exploration – Interpretation

We explored how UMAP’s key parameters (n_neighbors, min_dist, and metric) affect the projection of the gene expression data and the biological interpretability of sample clustering.

Number of Neighbors (n_neighbors):

  • With k = 5, UMAP revealed sharp and compact local clusters, but the overall structure appeared fragmented. For example, PMA and DMSO formed two disjoint subclusters each, with some replicate separation inside treatments. This setting emphasizes local relationships but distorts the global geometry.

  • With k = 15 (default), treatment clusters became more coherent and better defined. Replicates within each group showed more consistency, and the overall layout was biologically meaningful. This value provided a good trade-off between local and global structure.

  • With k = 50, treatment clusters (especially SAHA and PMA) were clearly separated and well aligned. However, within-cluster structure was more compressed, and finer patterns among replicates were slightly lost. This suggests a prioritization of global coherence over local detail.

Minimum Distance (min_dist):

  • A very small value (min_dist = 0.001) produced extremely compact clusters, making group boundaries sharp. While this helps highlight strong treatment separation, it can exaggerate distances between clusters and potentially distort global relationships.

  • min_dist = 0.1 offered a well-balanced layout. It preserved compact clusters and maintained reasonable inter-cluster distances, enabling better interpretation of both treatment and replicate effects.

  • At min_dist = 0.5, clusters appeared more spread out and partially overlapped. Replicates within treatments were still cohesive, but some visual separation was lost, reducing interpretability of distinct biological groups.

Distance Metric (metric):

  • Using euclidean distance yielded a familiar layout, similar to PCA, with clear treatment separation and tight replicates, but not the sharpest boundaries.

  • manhattan distance enhanced the delineation of clusters, especially in SAHA and PMA. It provided slightly better replicate consistency and clearer inter-group margins than euclidean.

  • cosine distance showed the best overall performance: DMSO, PMA, and SAHA formed three highly distinct, compact, and stable clusters. This metric captured the biological structure most effectively and minimized overlap or replicate variability.

To sum up, the most biologically meaningful and technically robust UMAP embedding was achieved with: n_neighbors = 50, min_dist = 0.1, and metric = “cosine”

This configuration provided strong treatment-based separation while preserving replicate consistency and minimizing technical noise. In contrast, very low n_neighbors or high min_dist reduced cluster clarity. The choice of distance metric had a substantial impact, with cosine clearly outperforming euclidean and manhattan in this dataset.

To evaluate technical consistency, we encoded replicates as shapes, which revealed that most parameter settings preserved within-treatment coherence across replicates. This supports the robustness of the embeddings and confirms that treatment is the dominant source of variation.

Interactive UMAP plot with optimized parameters

# this is to re-run UMAP with best parameter combination based on the previous exploration
umap_best <- run_umap_with_metadata(norm_expr, final_data,
                                    n_neighbors = 50,
                                    min_dist = 0.1,
                                    metric = "cosine")

# to plot it with shapes by replicate
umap_plot_interactive <- ggplot(umap_best, aes(x = UMAP1, y = UMAP2, color = treatment, shape = replicate,
                                               text = paste("Sample:", sample,
                                                            "<br>Treatment:", treatment,
                                                            "<br>Cell type:", cell_type,
                                                            "<br>Replicate:", replicate))) +
  geom_point(size = 2, alpha = 0.8) +
  scale_shape_manual(values = c("rep1" = 3, "rep2" = 16, "rep3" = 12)) +
  theme_minimal() +
  labs(title = "Interactive UMAP with Optimized Parameters", x = "UMAP1", y = "UMAP2")

ggplotly(umap_plot_interactive, tooltip = "text")

UMAP optimized interpretation:

Using the optimal combination of parameters derived from our previous exploration, UMAP reveals highly distinct treatment clusters (DMSO, PMA, SAHA) with minimal overlap. This version preserves both global structure and local detail better than the default settings. The internal consistency of replicates is also strong, suggesting high biological coherence.

This optimized UMAP plot supports our previous findings and gives additional confidence in the robustness of dimensionality reduction when appropriate parameters are selected.

QUESTION 9: Final interpretation: PCA vs. tSNE vs. UMAP

PCA (by cell type)

plot_pca_cell_type

tSNE (by cell type)

plot_tsne_cell_type

UMAP (by cell type)

plot_umap_cell_type

PCA (by replicate)

plot_pca_replicate

tSNE (by replicate)

plot_tsne_replicate

UMAP (by replicate)

plot_umap_replicate

Final interpretation: PCA vs. tSNE vs. UMAP

To conclude our exploratory analysis, we compared the results of PCA, tSNE, and UMAP using the final normalized and scaled dataset. We visualized the projections using the most relevant metadata variables: treatment (DMSO, PMA, SAHA), cell type (J-LatA2, Jurkat), and technical replicate (rep1, rep2, rep3). All three techniques captured meaningful biological variation, but they differ in the type of structure they reveal:

  • PCA revealed a clear linear separation of the three treatment groups. PC1 mostly separated SAHA from the rest, while PC2 added additional separation between DMSO and PMA. Interestingly, DMSO and PMA appeared relatively closer in PCA space, suggesting a greater similarity between these two treatments compared to SAHA, which consistently formed a distinct cluster. Clusters were compact and well aligned with both treatment and cell type, suggesting strong global trends. When visualized by replicate, samples from the same condition but different replicates clustered tightly, indicating that normalization effectively removed batch effects.

  • tSNE revealed more detailed local structure, with well-separated and compact clusters for each treatment. Replicates were tightly grouped, confirming high internal consistency. The method preserved the separation between treatments and also highlighted substructures within treatments, indicating heterogeneity within each group. This supports the idea that tSNE excels at preserving local neighborhoods while minimizing technical noise.

  • UMAP, using optimized parameters also captured clear biological groupings. Treatment groups were sharply separated, especially SAHA, which formed a well-defined and isolated cluster. PMA and DMSO were still distinguishable but appeared closer to each other, especially in the UMAP1 axis, which may reflect their more similar biological effect compared to SAHA. Cell types remained clearly separated within each treatment. Replicates clustered consistently, with minimal dispersion, confirming technical stability. Compared to PCA and tSNE, UMAP provided a balanced compromise between local and global structure, though its layout was slightly more sensitive to parameter tuning.

We did not observe any strong outliers in any of the three methods. All samples were embedded within reasonable distances of their respective groups, with no consistent deviations across plots.

To summarize:

  • PCA is most useful for summarizing global trends and detecting batch effects.
  • tSNE is ideal for revealing local structures and treatment-level heterogeneity.
  • UMAP, when properly tuned, offers a versatile view that combines global separation and replicate consistency.

All three methods led to consistent biological conclusions: treatment is the dominant source of variation, followed by cell type. Technical replicates were stable across all methods, validating the effectiveness of our preprocessing pipeline. Therefore, these results provide a solid foundation for downstream differential expression analysis.